Advanced Corpus Solutions for Humanities Researchers
نویسندگان
چکیده
This paper describes the design and implementation of an interface to corpora in 12 languages, stemming from the analysis of the needs of a diverse group of users: language teachers and language students, (non-computational) linguists, researchers in history and translation studies. We identified a set of requirements shared across the disciplines, as well as more specific requirements from the targeted user groups. The interface is designed to handle large-scale corpora of 20-500 million words.
منابع مشابه
Supporting Serendipitous and Focused Search
People with complex information needs are for example Humanities researchers, who need advanced search engines to investigate their research questions. Much can be gained by combining research datasets, reusing tools and serendipitously discovering new insights for further research. Humanities researchers have different (large-scale) research datasets and tools, which are described differently ...
متن کاملGutenTag: an NLP-driven Tool for Digital Humanities Research in the Project Gutenberg Corpus
This paper introduces a software tool, GutenTag, which is aimed at giving literary researchers direct access to NLP techniques for the analysis of texts in the Project Gutenberg corpus. We discuss several facets of the tool, including the handling of formatting and structure, the use and expansion of metadata which is used to identify relevant subcorpora of interest, and a general tagging frame...
متن کاملVocabulary Lists for EAP and Conversation Students
Despite the abundance of research investigating general and academic vocabularies and developing dozens of word lists, few studies have compared academic vocabulary with general service word lists such as conversation vocabulary. Many EAP researchers assume that university students need to know all the words in West’s (1953) General Service List (GSL) as a prerequisite to academic words (e.g., ...
متن کاملA multi-level multimedia concordancer for spoken language corpora (Un concordancier multi-niveaux et multimédia pour des corpus oraux) [in French]
Concordances have always played an important role in the analysis of language corpora, for studies in humanities, literature, linguistics, translation and language teaching. However, very few of the available systems support multi-level queries against a richly-annotated, sound-aligned spoken corpus. The rapid growth in the development of spoken corpora, particularly for French, increases the n...
متن کاملEnhancing Access to Media Collections and Archives Using Computational Linguistic Tools
In this paper, we outline the strategies, methodology, and infrastructure needed to bring advanced computational linguistic tools to researchers and archivists in the humanities. We discuss three use cases involving the application of the Language Application Grid (LAPPS), an open, web-based infrastructure providing interoperable access to hundreds of computational linguistic (CL) component web...
متن کامل